Snowflake Bulk Ingest with Storage Integration

In big data processing, sometimes it is required to load huge amounts of data in batches. In this scenario, bulk ingestion is extremely useful. Snowflake Bulk Ingest when used in the data integration stage, helps you load batches of data from files available in a data lake like Amazon S3. Bulk Ingest loads chunks of data every time you run the pipeline. Data is first pushed into a landing layer, and then sent to the unification layer.

Calibo's Data Pipeline Studio (DPS) supports bulk ingestion of data using Snowflake bulk ingest in the data integration stage, S3 as data lake in the data source stage, and Snowflake as the target data lake. Following is an example of a Snowflake bulk ingest data pipeline:

 

How Snowflake Bulk Ingest works

In case of Snowflake bulk ingest, every time you run the data pipeline, data from S3 is ingested into the data lake. During data ingestion, the data is first pushed into a landing layer. Depending on the use case, operations like append, overwrite or merge are performed on the data. The processed data is then pushed into the unification layer. During this process credentials are required to access the Amazon S3 bucket. Read, write permissions are required to Snowflake objects. You can avoid sharing credentials directly and instead use Storage Integration.

 

What is Storage Integration?

Storage Integration is a Snowflake object that helps you to connect to the AWS account from Snowflake using the IAM service. You can specify allowed and blocked storage locations. This way you can provide enhanced security to the complete data ingestion operation.

Prerequisites for using Snowflake Bulk Ingest in the data integration layer:

  • You must have Amazon S3 and Snowflake data lake configured in the Lazsa Platform.

  • You must have a storage integration created in Snowflake.

     

To create a data integration job for Snowflake bulk ingest

  1. On the home page of DPS, add the following stages. Your pipeline looks like this:

    SF Bulk Ingest Storage Integration pipeline

    1. Data Lake: Amazon S3

    2. Data Integration: Snowflake Bulk Ingest

    3. Data Lake: Snowflake

  2. Configure the Amazon S3 and Snowflake nodes.

  3. Click on the data integration node and click Create Job.

  4. For the data integration job creation, provide the following inputs:

 

Related Topics Link IconRecommended Topics `What's next? Data Integration using AWS Glue